%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "courseml"
%%% End: 

\chapter{Expectation Maximization} \label{sec:em}

\chapterquote{A hen is only an egg's way of making another egg.}{Samuel~Butler}

\begin{learningobjectives}
\item Explain the relationship between parameters and hidden
  variables.
\item Construct generative stories for clustering and dimensionality
  reduction.
\item Draw a graph explaining how EM works by constructing convex
  lower bounds.
\item Implement EM for clustering with mixtures of Gaussians, and
  contrasting it with $k$-means.
\item Evaluate the differences betweem EM and gradient descent for
  hidden variable models.
\end{learningobjectives}

\dependencies{}

\newthought{Suppose you were building} a naive Bayes model for a text
categorization problem.  After you were done, your boss told you that
it became prohibitively expensive to obtain labeled data.  You now
have a probabilistic model that assumes access to labels, but you
don't have any labels!  Can you still do something?

Amazingly, you can.  You can treat the labels as \concept{hidden
  variables}, and attempt to learn them at the same time as you learn
the parameters of your model.  A very broad family of algorithms for
solving problems just like this is the \concept{expectation
  maximization} family.  In this chapter, you will derive expectation
maximization (EM) algorithms for clustering and dimensionality reduction,
and then see why EM works.

\section{Grading an Exam without an Answer Key}

Alice's machine learning professor Carlos gives out an exam that consists of 50 true/false questions.
Alice's class of 100 students takes the exam and Carlos goes to grade their solutions.
If Carlos made an answer key, this would be easy: he would just count the fraction of correctly answered questions each student got, and that would be their score.
But, like many professors, Carlos was really busy and didn't have time to make an answer key.
Can he still grade the exam?

There are two insights that suggest that he might be able to.
Suppose he know ahead of time that Alice was an awesome student, and is basically guaranteed to get 100\% on the exam.
In that case, Carlos can simply use Alice's answers as the ground truth.
More generally, if Carlos assumes that \emph{on average} students are better than random guessing, he can hope that the majority answer for each question is likely to be correct.
Combining this with the previous insight, when doing the ``voting'', he might want to pay more attention to the answers of the better students.

To be a bit more pedantic, suppose there are $N=100$ students and $M=50$ questions.
Each student $n$ has a score $s_n$, between $0$ and $1$ that denotes how well they do on the exam.
The score is what we really want to compute.
For each question $m$ and each student $n$, the student has provided an answer $a_{n,m}$, which is either zero or one.
There is also an unknown ground truth answer for each question $m$, which we'll call $t_m$, which is also either zero or one.

As a starting point, let's consider a simple heuristic and then complexify it.
The heuristic is the ``majority vote'' heuristic and works as follows.
First, we estimate $t_m$ as the most common answer for question $m$:
$t_m = \argmax_t \sum_n \Ind[a_{n,m} = t]$.
Once we have a guess for each true answer, we estimate each students' score as how many answers they produced that match this guessed key:
$s_n = \frac 1 M \sum_m \Ind[a_{n,m} = t_m]$.

Once we have these scores, however, we might want to trust some of the students more than others.
In particular, answers from students with high scores are perhaps more likely to be correct, so we can \emph{recompute} the ground truth, according to weighted votes.
The weight of the votes will be precisely the score the corresponding each student:
%
\begin{align}
  t_m = \argmax_t \sum_n s_n \Ind[a_{n,m} = t]
\end{align}
%
You can recognize this as a \emph{chicken and egg problem}. If you knew the student's scores, you could estimate an answer key. If you had an answer key, you could compute student scores.
A very common strategy in computer science for dealing with such chicken and egg problems is to iterate.
Take a guess at the first, compute the second, recompute the first, and so on.

In order to develop this idea formally, we have to case the problem in terms of a probabilistic model with a generative story.
The generative story we'll use is:
%
\begin{enumerate}
  \item \textcolor{darkpurple}{For each question $m$, choose a true answer $t_m \sim \Ber(0.5)$}
  \item \textcolor{darkblue}{For each student $n$, choose a score $s_n \sim \Uni(0,1)$}
  \item \textcolor{darkred}{For each question $m$ and each student $n$, choose an answer\\ $a_{n,m} \sim \Ber(s_n)^{t_m}  \Ber(1-s_n)^{1-t_m}$}
\end{enumerate}
~
In the first step, we generate the true answers independently by flipping a fair coin.
In the second step, each students' overall score is determined to be a uniform random number between zero and one.
The tricky step is step three, where each students' answer is generated for each question.
Consider student $n$ answering question $m$, and suppose that $s_n = 0.9$.
If $t_m = 1$, then $a_{n,m}$ should be 1 (i.e., correct) 90\% of the time;
this can be accomplished by drawing the answer from $\Ber(0.9)$.
On the other hand, if $t_m = 0$, then $a_{n,m}$ should 1 (i.e., incorrect) 10\% of the time;
this can be accomplished by drawing the answer from $\Ber(0.1)$.
The exponent in step 3 selects which of two Bernoulli distributions to draw from, and then implements this rule.

This can be translated into the following likelihood:
~
\begin{align}
  &p(\vec a, \vec t, \vec s) \nonumber\\
  &=      \textcolor{darkpurple}{\left[ \prod_m 0.5^{t_m} 0.5^{1-t_m} \right]} \times
            \textcolor{darkblue}{\left[ \prod_n 1 \right]} \nonumber\\
  &\qquad \times    \textcolor{darkred}{\left[ \prod_n\prod_m 
    s_n^{a_{n,m}t_m}
    (1-s_n)^{(1-a_{n,m})t_m} \right.} \nonumber\\
  &\qquad\qquad\textcolor{darkred}{\left. s_n^{(1-a_{n,m})(1-t_m)}
    (1-s_n)^{a_{n,m}(1-t_m)}
    \right]} \\
 &= \textcolor{darkpurple}{0.5^M}
   \textcolor{darkred}{
\prod_n\prod_m 
    s_n^{a_{n,m}t_m}
    (1-s_n)^{(1-a_{n,m})t_m}
   s_n^{(1-a_{n,m})(1-t_m)}
    (1-s_n)^{a_{n,m}(1-t_m)}
   }
\end{align}

Suppose we knew the true lables $\vec t$. We can take the log of this likelihood and differentiate it with respect to the score $s_n$ of some student (note: we can drop the $0.5^M$ term because it is just a constant):
~
\begin{align}
\log p(\vec a, \vec t, \vec s)
&= \sum_n \sum_m \Big[
     a_{n,m}t_m \log s_n
 +   (1-a_{n,m})(1-t_m) \log (s_n) \nonumber \\
&+   (1-a_{n,m})t_m \log (1-s_n)
 +   a_{n,m}(1-t_m) \log (1-s_n) \Big]\\
\frac {\partial \log p(\vec a, \vec t, \vec s)} {\partial s_n}
&= \sum_m \Big[
     \frac {a_{n,m} t_m + (1-a_{n,m}) (1-t_m)} {s_n}
   - \frac {(1-a_{n,m}) t_m + a_{n,m} (1-t_m)} {1-s_n} \Big]
\end{align}
%
The derivative has the form $\frac A {s_n} - \frac B {1-s_n}$. If we set this equal to zero and
solve for $s_n$, we get an optimum of $s_n = \frac A {A+B}$. In this case:
%
\begin{align}
  A &= \sum_m \big[ a_{n,m} t_m + (1-a_{n,m}) (1-t_m) \big]\\
  B &= \sum_m \big[ (1-a_{n,m}) t_m + a_{n,m} (1-t_m) \big]\\
A+B &= \sum_m \big[ 1 \big] = M
\end{align}
%
Putting this together, we get:
%
\begin{align}
  s_n &= \frac 1 M \sum_m \big[ a_{n,m} t_m + (1-a_{n,m}) (1-t_m) \big] 
\end{align}
%
In the case of known $t$s, this matches exactly what we had in the heuristic.

However, we do not know $t$, so instead of using the ``true'' values of $t$, we're going to use their \emph{expectations}.
In particular, we will compute $s_n$ by maximizing its likelihood under the \emph{expected} values of $t$, hence the name \concept{expectation maximization}.
%
If we are going to compute expectations of $t$, we have to say: expectations according to which probability distribution?
We will use the distribution $p(t_m \| \vec a, \vec s)$.
Let $\tilde t_m$ denote $\Ep_{t_m \sim p(t_m \| \vec a, \vec s)}[t_m]$.
Because $t_m$ is a binary variable, its expectation is equal to it's probability;
namely: $\tilde t_m = p(t_m \| \vec a, \vec s)$.

How can we compute this?
We will compute $C = p(t_m=1, \vec a, \vec s)$ and $D = p(t_m=0, \vec a, \vec s)$ and then compute $\tilde t_m = C / (C+D)$.
The computation is straightforward:
%
\begin{align}
  C &= 0.5 \prod_n s_n^{a_{n,m}} (1-s_n)^{1-a_{n,m}} 
    &= 0.5 \prod_{\substack{n : \\ a_{n,m} = 1}} s_n \prod_{\substack{n : \\ a_{n,m} = 0}} (1-s_n) \\
  D &= 0.5 \prod_n s_n^{1-a_{n,m}} (1-s_n)^{a_{n,m}}
    &= 0.5 \prod_{\substack{n : \\ a_{n,m} = 1}} (1-s_n) \prod_{\substack{n : \\ a_{n,m} = 0}} s_n
\end{align}
%
If you inspect the value of $C$, it is basically ``voting'' (in a product form, not a sum form) the scores of those students who agree that the answer is $1$ with one-minus-the-score of those students who do not.
The value of $D$ is doing the reverse.
This is a form of multiplicative voting, which has the effect that if a given student has a perfect score of $1.0$, their results will carry the vote completely.

We now have a way to:
\begin{enumerate}
\item Compute expected ground truth values $\tilde t_m$, given scores.
\item Optimize scores $s_n$ given expected ground truth values.
\end{enumerate}
The full solution is then to alternate between these two.
You can start by initializing the ground truth values at the majority vote (this seems like a safe initialization).
Given those, compute new scores.
Given those new scores, compute new ground truth values.
And repeat until tired.

In the next two sections, we will consider a more complex unsupervised learning model for clustering,
and then a generic mathematical framework for expectation maximization, which will answer questions like: will this process converge, and, if so, to what?

% at + (1-a)(1-t) = at + 1 - a - t + at =  2at + 1 - a -t
% (1-a)t + a(1-t) = t - at + a - at     = -2at + a + t
% z/x - y/(1-x) = 0
% <=> z/x = y/(1-x)
% <=> (1-x) z = x y
% <=> z - xz = xy
% <=> z = x(z + y)
% <=> x = z / (z+y)

\section{Clustering with a Mixture of Gaussians}

In Chapter~\ref{sec:prob}, you learned about probabilitic models for
classification based on density estimation.  Let's start with a fairly
simple classification model that \emph{assumes} we have labeled data.
We will shortly remove this assumption.  Our model will state that we
have $K$ classes, and data from class $k$ is drawn from a Gaussian
with mean $\vec\mu_k$ and variance $\si^2_k$.  The choice of classes
is parameterized by $\vec\th$.  The generative story for this model
is:
%
\begin{enumerate}
  \item \textcolor{darkpurple}{For each example} $n = 1 \dots N$:
    \begin{enumerate}
      \item \textcolor{darkblue}{Choose a label} $y_n \sim \Disc(\vec\th)$
      \item \textcolor{darkred}{Choose example} $\vx_n \sim \Nor(\vec\mu_{y_n}, \si^2_{y_n})$
     \end{enumerate}
\end{enumerate}
%
This generative story can be directly translated into a likelihood
as before:
%
\begin{align}
  p(D)
  &= \prod_n \textcolor{darkblue}{\Mult(y_n \| \vec\th)} \textcolor{darkred}{\Nor(\vx_n \| \vec\mu_{y_n}, \si^2_{y_n})}\\
  &=  \overbrace{
      \prod_n
       \underbrace{\th_{y_n}}_{\textsf{\textcolor{darkblue}{choose label}}}
         \underbrace{
         \left[2 \pi \si_{y_n}^2\right]^{-\frac D 2}
         \exp\left[
           -\frac 1 {2 \si_{y_n}^2} \norm{\vx_n - \vec\mu_{y_n}}^2
           \right]}_{\textsf{\textcolor{darkred}{choose feature values}}}}
       ^{\textsf{\textcolor{darkpurple}{for each example}}}
\end{align}
%
If you had access to labels, this would be all well and good, and you
could obtain closed form solutions for the maximum likelihood
estimates of all parameters by taking a log and then taking gradients
of the log likelihood:
%
\begin{align}
\th_k &= \text{fraction of training examples in class $k$} \\
&= \frac 1 N \sum_n [y_n = k] \nonumber\\
\vec\mu_k &= \text{mean of training examples in class $k$} \\
&= \frac {\sum_n [y_n = k] \vx_n} {\sum_n [y_n = k]} \nonumber\\
\si^2_k &= \text{variance of training examples in class $k$} \\
&= \frac {\sum_n [y_n = k] \norm{\vx_n-\mu_k}} {\sum_n [y_n = k]} \nonumber
\end{align}
%
\thinkaboutit{You should be able to derive the maximum likelihood solution results formally by now.}
%
Suppose that you \emph{don't} have labels.  Analogously to the
$K$-means algorithm, one potential solution is to iterate.  You can
start off with guesses for the values of the unknown variables, and
then iteratively improve them over time.  In $K$-means, the approach
was the \emph{assign} examples to labels (or clusters).  This time,
instead of making hard assignments (``example $10$ belongs to cluster
$4$''), we'll make \concept{soft assignments} (``example $10$ belongs
half to cluster $4$, a quarter to cluster $2$ and a quarter to cluster
$5$'').  So as not to confuse ourselves too much, we'll introduce a
new variable, $\vec z_n = \langle z_{n,1}, \dots, z_{n,K}$ (that sums
to one), to denote a fractional assignment of examples to clusters.

\TODOFigure{em:piecharts}{A figure showing pie charts}

This notion of soft-assignments is visualized in
Figure~\ref{fig:em:piecharts}.  Here, we've depicted each example as a
pie chart, and it's coloring denotes the degree to which it's been
assigned to each (of three) clusters.  The size of the pie pieces
correspond to the $\vec z_n$ values.

Formally, $z_{n,k}$ denotes the probability that example $n$ is
assigned to cluster $k$:
%
\begin{align}
z_{n,k} &= p(y_n = k \| \vx_n) \\
&= \frac {p(y_n = k, \vx_n)} {p(\vx_n)} \\
&= \frac 1 {Z_n} \Mult(k \| \vec\th) \Nor(\vx_n \| \vec\mu_k, \si^2_k)
\end{align}
%
Here, the normalizer $Z_n$ is to ensure that $\vec z_n$ sums to one.

Given a set of parameters (the $\vec\th$s, $\vec\mu$s and $\si^2$s),
the \concept{fractional assignments} $z_{n,k}$ are easy to compute.
Now, akin to $K$-means, given fractional assignments, you need to
recompute estimates of the model parameters.  In analogy to the
maximum likelihood solution (Eqs~\eqref{}-\eqref{}), you can do this
by counting fractional points rather than full points.  This gives the
following re-estimation updates:
%
\begin{align}
\th_k &= \text{fraction of training examples in class $k$} \\
&= \frac 1 N \sum_n z_{n,k} \nonumber\\
\vec\mu_k &= \text{mean of fractional examples in class $k$} \\
&= \frac {\sum_n z_{n,k} \vx_n} {\sum_n z_{n,k}} \nonumber\\
\si^2_k &= \text{variance of fractional examples in class $k$} \\
&= \frac {\sum_n z_{n,k} \norm{\vx_n-\mu_k}} {\sum_n z_{n,k}} \nonumber
\end{align}
%
All that has happened here is that the hard assignments ``$[y_n=k]$''
have been replaced with soft assignments ``$z_{n,k}$''.  As a bit of
foreshadowing of what is to come, what we've done is essentially
replace known labels with \emph{expected labels,} hence the name
``expectation maximization.''

\newalgorithm%
  {em:gmm}%
  {\FUN{GMM}(\VAR{$\mat X$}, \VAR{K})}
  {
\FOR{$\VAR{k} = \CON{1}$ \TO $\VAR{K}$}
\SETST{$\vec\mu_k$}{some random location} \COMMENT{randomly initialize
  mean for $k$th cluster}
\SETST{$\si^2_k$}{1} \COMMENT{initialize variances}
\SETST{$\th_k$}{$1/K$} \COMMENT{each cluster equally likely a priori}
\ENDFOR
\REPEAT
\FOR{$\VAR{n} = \CON{1}$ \TO $\VAR{N}$}
\FOR{$\VAR{k} = \CON{1}$ \TO $\VAR{K}$}
\SETST{$z_{n,k}$}{$\VARm{\th_k} \left[2 \pi \VARm{\si_{k}^2}\right]^{-\frac D 2} \exp\left[-\frac 1 {2 \VARm{\si_{k}^2}} \norm{\VARm{\vx_n} - \VARm{\vec\mu_k}}^2\right]$}
  \COMMENT{compute (unnormalized) fractional assignments}
\ENDFOR
\SETST{$\vec z_n$}{$\frac 1 {\sum_k \VARm{z_{n,k}}} \VARm{\vec z_n}$}
  \COMMENT{normalize fractional assignments}
\ENDFOR
\FOR{$\VAR{k} = \CON{1}$ \TO $\VAR{K}$}
\SETST{$\th_k$}{$\frac 1 {\VARm{N}} \sum_n \VARm{z_{n,k}}$}
  \COMMENT{re-estimate prior probability of cluster $k$}
\SETST{$\vec\mu_k$}{$\frac {\sum_n \VARm{z_{n,k}} \VARm{\vx_n}} {\sum_n \VARm{z_{n,k}}}$}
  \COMMENT{re-estimate mean of cluster $k$}
\SETST{$\si^2_k$}{$\frac {\sum_n \VARm{z_{n,k}} \norm{\VARm{\vx_n}-\VARm{\mu_k}}} {\sum_n \VARm{z_{n,k}}}$}
  \COMMENT{re-estimate variance of cluster $k$}
\ENDFOR
\UNTIL{converged}
\RETURN \VAR{$\vec z$} \COMMENT{return cluster assignments}
}

Putting this together yields Algorithm~\ref{alg:em:gmm}.  This is the
\concept{GMM} (``\concept{Gaussian Mixture Models}'') algorithm,
because the probabilitic model being learned describes a dataset as
being drawn from a mixture distribution, where each component of this
distribution is a Gaussian.

\thinkaboutit{Aside from the fact that GMMs use soft assignments and $K$-means uses hard assignments, there are other differences between the two approaches.  What are they?}

Just as in the $K$-means algorithm, this approach is succeptible to
local optima and quality of initialization.  The heuristics for
computing better initializers for $K$-means are also useful here.

\section{The Expectation Maximization Framework}

At this point, you've seen a method for learning in a particular
probabilistic model with hidden variables.  Two questions remain: (1)
can you apply this idea more generally and (2) why is it even a
reasonable thing to do?  Expectation maximization is a \emph{family}
of algorithms for performing maximum likelihood estimation in
probabilistic models with hidden variables.

\TODOFigure{em:lowerbound}{A figure showing successive lower bounds}

The general flavor of how we will proceed is as follows.  We want to
maximize the log likelihood $\cL$, but this will turn out to be
difficult to do directly.  Instead, we'll pick a surrogate function
$\tilde\cL$ that's a lower bound on $\cL$ (i.e., $\tilde\cL \leq \cL$
everywhere) that's (hopefully) easier to maximize.  We'll construct
the surrogate in such a way that increasing it will force the true
likelihood to also go up.  After maximizing $\tilde\cL$, we'll
construct a \emph{new} lower bound and optimize that.  This process is
shown pictorially in Figure~\ref{fig:em:lowerbound}.

To proceed, consider an arbitrary probabilistic model $p(\vx,\vy \|
\vth$), where $\vx$ denotes the observed data, $\vy$ denotes the
hidden data and $\vth$ denotes the parameters.  In the case of
Gaussian Mixture Models, $\vx$ was the data points, $\vy$ was the
(unknown) labels and $\vth$ included the cluster prior probabilities,
the cluster means and the cluster variances.  Now, given access
\emph{only} to a number of examples $\vx_1, \dots, \vx_N$, you would
like to estimate the parameters ($\vth$) of the model.

Probabilistically, this means that some of the variables are unknown
and therefore you need to marginalize (or sum) over their possible
values.  Now, your data consists only of $\mat X = \langle \vx_1,
\vx_2, \dots, \vx_N \rangle$, not the $(\vx,y)$ pairs in $D$.  You can
then write the likelihood as:
%
\begin{align}
  p(\mat X \| \vth)
  &= \sum_{y_1} \sum_{y_2} \cdots \sum_{y_N}  p(\mat X, y_1, y_2, \dots y_N \| \vth)  \becauseof{marginalization}\\
  &= \sum_{y_1} \sum_{y_2} \cdots \sum_{y_N} \prod_n p(\vx_n, y_n \| \vth) \becauseof{examples are independent}\\
  &= \prod_n \sum_{y_n} p(\vx_n, y_n \| \vth) \becauseof{algebra}
\end{align}
%
At this point, the natural thing to do is to take logs and then start
taking gradients.  However, once you start taking logs, you run into a
problem: the log cannot eat the sum!
%
\begin{align}
  \cL(\mat X \| \vth)
  &= \sum_n \log \sum_{y_n} p(\vx_n, y_n \| \vth)
\end{align}
%
Namely, the log gets ``stuck'' outside the sum and cannot move in to
decompose the rest of the likelihood term!

The next step is to apply the somewhat strange, but strangely useful,
trick of multiplying by $1$.  In particular, let $q(\cdot)$ be an
arbitrary probability distribution.  We will multiply the $p(\dots)$
term above by $q(y_n) / q(y_n)$, a valid step so long as $q$ is never
zero.  This leads to:
%
\begin{align}
  \cL(\mat X \| \vth)
  &= \sum_n \log \sum_{y_n} q(y_n) \frac {p(\vx_n, y_n \| \vth)} {q(y_n)}
\end{align}
%
We will now construct a lower bound using \concept{Jensen's
  inequality}.  This is a very useful (and easy to prove!) result that
states that $f(\sum_i \la_i x_i) \geq \sum_i \la_i f(x_i)$, so long as
(a) $\la_i \geq 0$ for all $i$, (b) $\sum_i \la_i = 1$, and (c) $f$ is
concave.  If this looks familiar, that's just because it's a direct
result of the definition of \concept{concavity}.  Recall that $f$ is
concave if $f(a x + b y) \geq a f(x) + b f(x)$ whenever $a+b=1$.

\thinkaboutit{Prove Jensen's inequality using the definition of concavity and induction.}

You can now apply Jensen's inequality to the log likelihood by
identifying the list of $q(y_n)$s as the $\la$s, $\log$ as $f$ (which
is, indeed, concave) and each ``$x$'' as the $p/q$ term.  This yields:
%
\begin{align}
  \cL(\mat X \| \vth)
  &\geq \sum_n \sum_{y_n} q(y_n) \log \frac {p(\vx_n, y_n \| \vth)} {q(y_n)}\\
  &= \sum_n \sum_{y_n} \Big[ q(y_n) \log p(\vx_n, y_n \| \vth) - q(y_n) \log q(y_n)\Big]\\
  &\defeq \tilde\cL(\mat X \| \vth)
\end{align}
%
Note that this inequality holds for \emph{any} choice of function $q$,
so long as its non-negative and sums to one.  In particular, it
needn't even by the same function $q$ for each $n$.  We will need to
take advantage of both of these properties.

We have succeeded in our first goal: constructing a lower bound on
$\cL$.  When you go to optimize this lower bound for $\vth$, the only
part that matters is the first term.  The second term, $q \log q$,
drops out as a function of $\vth$.  This means that the the
maximization you need to be able to compute, for fixed $q_n$s, is:
%
\begin{align}
  \vth\newth \leftarrow \arg\max_{\vth} \sum_n \sum_{y_n} q_n(y_n) \log p(\vx_n, y_n \| \vth)
\end{align}
%
This is \emph{exactly} the sort of maximization done for Gaussian
mixture models when we recomputed new means, variances and cluster
prior probabilities.

The second question is: what should $q_n(\cdot)$ actually be?  Any
reasonable $q$ will lead to a lower bound, so in order to choose one
$q$ over another, we need another criterion.  Recall that we are
hoping to maximize $\cL$ by instead maximizing a lower bound.  In
order to ensure that an increase in the lower bound implies an
increase in $\cL$, we need to ensure that $\cL(\mat X \| \vth) =
\tilde\cL(\mat X \| \vth)$.  In words: $\tilde\cL$ should be a lower
bound on $\cL$ that makes contact at the current point, $\vth$.  
%This
%is shown in Figure~\ref{fig:em:lb}, including a case where the lower
%bound does \emph{not} make contact, and thereby does not guarantee an
%increase in $\cL$ with an increase in $\tilde\cL$.

% \section{EM versus Gradient Descent}

% computing gradients through marginals

% step size

% \section{Dimensionality Reduction with Probabilistic PCA}

% derivation

% advantages over pca

\section{Further Reading}

TODO further reading


\begin{comment}
   - Latent variable models (versus parameters)
   - Marginalization, expectations
   - Gaussian mixture models as naive Bayes without labels
   - Expectation maximization in general
   - EM vs gradient descent
   - ``hard'' em
\end{comment}



%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "courseml"
%%% End: 
